Compact q-Gram Profiling of Compressed Strings

نویسندگان

  • Philip Bille
  • Patrick Hagge Cording
  • Inge Li Gørtz
چکیده

We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O(N ↵) expected time and uses O(n+kT,q) space, where N ↵  qn is the exact number of characters decompressed by the algorithm and kT,q  N ↵ is the number of distinct q-grams in T . This simultaneously matches the current best known time bound and improves the best known space bound. Our space bound is asymptotically optimal in the sense that any algorithm storing the grammar and the q-gram profile must use ⌦(n+ kT,q) space. To achieve this we introduce the q-gram graph that space-e ciently captures the structure of a string with respect to its q-grams, and show how to construct it from a grammar.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algorithms and data structures for grammar - compressed strings

This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...

متن کامل

Data Structures for Grammar-compressed Strings

This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...

متن کامل

Fast q-gram Mining on SLP Compressed Strings

We present simple and efficient algorithms for calculating qgram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T , we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T . Computational experiments show that our algorithm and its variation are pract...

متن کامل

QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings

Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least ...

متن کامل

Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013